## [1] 4e-04
## [1] 0.03358333
Not stationary (why?), so difference by 1 lag.
And if we difference by 2 lags:
Plot ACF and PACF
## [1] 0.92
## [1] "No"
## [1] 0.97
## [1] "Yes"
## [1] 0.98
## [1] "Yes"
## [1] 0.96
## [1] "Yes"
Let’s try different ARMA(p,q) models for Dx:
## initial value -5.517013
## iter 2 value -5.610226
## iter 3 value -5.639102
## iter 4 value -5.667613
## iter 5 value -5.680920
## iter 6 value -5.686832
## iter 7 value -5.688360
## iter 8 value -5.690604
## iter 9 value -5.691425
## iter 10 value -5.691657
## iter 11 value -5.691992
## iter 12 value -5.693876
## iter 13 value -5.694710
## iter 14 value -5.695386
## iter 15 value -5.695689
## iter 16 value -5.695873
## iter 17 value -5.695883
## iter 18 value -5.695895
## iter 19 value -5.695895
## iter 20 value -5.695895
## iter 20 value -5.695895
## iter 20 value -5.695895
## final value -5.695895
## converged
## initial value -5.526162
## iter 2 value -5.543360
## iter 3 value -5.552982
## iter 4 value -5.559251
## iter 5 value -5.561630
## iter 6 value -5.564113
## iter 7 value -5.565733
## iter 8 value -5.566584
## iter 9 value -5.567171
## iter 10 value -5.567757
## iter 11 value -5.567902
## iter 12 value -5.568102
## iter 13 value -5.568339
## iter 14 value -5.568562
## iter 15 value -5.568747
## iter 16 value -5.568992
## iter 17 value -5.569225
## iter 18 value -5.569312
## iter 19 value -5.569330
## iter 20 value -5.569338
## iter 21 value -5.569340
## iter 22 value -5.569340
## iter 23 value -5.569340
## iter 23 value -5.569340
## iter 23 value -5.569340
## final value -5.569340
## converged
## $fit
##
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D,
## Q), period = S), xreg = constant, optim.control = list(trace = trc, REPORT = 1,
## reltol = tol))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ar6 ar7
## -0.5590 -0.0396 0.0019 -0.0169 -0.0311 -0.1607 -0.1924
## s.e. 0.2401 0.1043 0.0997 0.1004 0.1035 0.0994 0.1110
## ar8 ar9 ar10 ar11 ar12 ar13 ar14 ar15
## -0.2988 -0.3405 -0.0914 0.0819 0.0875 0.0427 -0.1522 -0.2951
## s.e. 0.1086 0.1131 0.1076 0.1051 0.1045 0.1057 0.1058 0.1162
## ar16 ma1 constant
## -0.3723 0.6799 -1e-04
## s.e. 0.0968 0.2539 2e-04
##
## sigma^2 estimated as 1.405e-05: log likelihood = 473.15, aic = -908.29
##
## $degrees_of_freedom
## [1] 97
##
## $ttable
## Estimate SE t.value p.value
## ar1 -0.5590 0.2401 -2.3277 0.0220
## ar2 -0.0396 0.1043 -0.3794 0.7053
## ar3 0.0019 0.0997 0.0194 0.9846
## ar4 -0.0169 0.1004 -0.1686 0.8665
## ar5 -0.0311 0.1035 -0.3009 0.7641
## ar6 -0.1607 0.0994 -1.6162 0.1093
## ar7 -0.1924 0.1110 -1.7332 0.0862
## ar8 -0.2988 0.1086 -2.7513 0.0071
## ar9 -0.3405 0.1131 -3.0105 0.0033
## ar10 -0.0914 0.1076 -0.8494 0.3978
## ar11 0.0819 0.1051 0.7787 0.4381
## ar12 0.0875 0.1045 0.8373 0.4045
## ar13 0.0427 0.1057 0.4036 0.6874
## ar14 -0.1522 0.1058 -1.4392 0.1533
## ar15 -0.2951 0.1162 -2.5393 0.0127
## ar16 -0.3723 0.0968 -3.8475 0.0002
## ma1 0.6799 0.2539 2.6776 0.0087
## constant -0.0001 0.0002 -0.5220 0.6028
##
## $AIC
## [1] -9.859923
##
## $AICc
## [1] -9.772967
##
## $BIC
## [1] -10.43028
Looking at the ACF for Dx, we note that there does appear to be seasonality. Each flu season has 23 weeks, and we observe spikes in the ACF every 23 weeks. So let’s try to account for seasonality.
## [1] 0.9333333
## [1] "No"
## [1] 0.9666667
## [1] "Yes"
Let’s try to fit a SARIMA model for D23x: Using good/bad to say whether final coefficients are significant. i.e. if the model is arima(2,1,3), I only need the AR(2) and MA(3) coefficients to be significant. Also, for better reasoning on choosing p,d,q,P,D,Q,S, look further down the page for where we fit a model to the data from vaccinated respondents.
## initial value -5.536587
## iter 2 value -5.678612
## iter 3 value -5.734583
## iter 4 value -5.752971
## iter 5 value -5.759835
## iter 6 value -5.760877
## iter 7 value -5.765845
## iter 8 value -5.772118
## iter 9 value -5.776981
## iter 10 value -5.780535
## iter 11 value -5.781611
## iter 12 value -5.783016
## iter 13 value -5.783849
## iter 14 value -5.784998
## iter 15 value -5.786078
## iter 16 value -5.786756
## iter 17 value -5.787928
## iter 18 value -5.788666
## iter 19 value -5.788950
## iter 20 value -5.789166
## iter 21 value -5.789552
## iter 22 value -5.789836
## iter 23 value -5.790100
## iter 24 value -5.790665
## iter 25 value -5.790914
## iter 26 value -5.791014
## iter 27 value -5.791061
## iter 28 value -5.791080
## iter 29 value -5.791092
## iter 30 value -5.791098
## iter 31 value -5.791104
## iter 32 value -5.791116
## iter 33 value -5.791131
## iter 34 value -5.791147
## iter 35 value -5.791158
## iter 36 value -5.791164
## iter 37 value -5.791170
## iter 38 value -5.791172
## iter 39 value -5.791172
## iter 39 value -5.791172
## iter 39 value -5.791172
## final value -5.791172
## converged
## initial value -5.497575
## iter 2 value -5.525397
## iter 3 value -5.545369
## iter 4 value -5.555749
## iter 5 value -5.555900
## iter 6 value -5.567166
## iter 7 value -5.570839
## iter 8 value -5.576464
## iter 9 value -5.579911
## iter 10 value -5.583920
## iter 11 value -5.593006
## iter 12 value -5.597103
## iter 13 value -5.597992
## iter 14 value -5.603064
## iter 15 value -5.605340
## iter 16 value -5.608235
## iter 17 value -5.608897
## iter 18 value -5.610248
## iter 19 value -5.610554
## iter 20 value -5.610790
## iter 21 value -5.610977
## iter 22 value -5.611261
## iter 23 value -5.611512
## iter 24 value -5.612009
## iter 25 value -5.612373
## iter 26 value -5.612564
## iter 27 value -5.612685
## iter 28 value -5.612775
## iter 29 value -5.612866
## iter 30 value -5.612999
## iter 31 value -5.613079
## iter 32 value -5.613271
## iter 33 value -5.613404
## iter 34 value -5.613458
## iter 35 value -5.613623
## iter 36 value -5.613690
## iter 37 value -5.613762
## iter 38 value -5.613815
## iter 39 value -5.613844
## iter 40 value -5.613870
## iter 41 value -5.613889
## iter 42 value -5.613902
## iter 43 value -5.613909
## iter 44 value -5.613912
## iter 45 value -5.613918
## iter 46 value -5.613922
## iter 47 value -5.613924
## iter 48 value -5.613925
## iter 49 value -5.613925
## iter 50 value -5.613925
## iter 50 value -5.613925
## final value -5.613925
## converged
## $fit
##
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D,
## Q), period = S), include.mean = !no.constant, optim.control = list(trace = trc,
## REPORT = 1, reltol = tol))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ar6 ar7 ar8
## -0.0448 -0.0710 -0.0447 0.4365 0.5544 -0.0948 -0.0417 -0.1566
## s.e. 0.4117 0.1974 0.1616 0.1531 0.3045 0.1371 0.1456 0.1322
## ar9 ar10 ar11 ar12 ar13 ar14 ar15 ar16
## -0.1812 0.1396 0.1659 0.0306 0.0580 -0.0612 -0.1571 -0.0416
## s.e. 0.1443 0.1416 0.1865 0.1431 0.1419 0.1371 0.1355 0.1259
## ar17 ar18 ma1 ma2 ma3 ma4 ma5 sma1
## 0.2554 -0.1647 -0.1702 0.0199 0.2131 -0.5105 -0.5523 -0.2570
## s.e. 0.1236 0.1870 0.4253 0.2957 0.1834 0.2070 0.3702 0.1608
##
## sigma^2 estimated as 1.117e-05: log likelihood = 381.74, aic = -713.49
##
## $degrees_of_freedom
## [1] 91
##
## $ttable
## Estimate SE t.value p.value
## ar1 -0.0448 0.4117 -0.1089 0.9135
## ar2 -0.0710 0.1974 -0.3597 0.7199
## ar3 -0.0447 0.1616 -0.2769 0.7825
## ar4 0.4365 0.1531 2.8511 0.0054
## ar5 0.5544 0.3045 1.8207 0.0719
## ar6 -0.0948 0.1371 -0.6913 0.4911
## ar7 -0.0417 0.1456 -0.2863 0.7753
## ar8 -0.1566 0.1322 -1.1849 0.2391
## ar9 -0.1812 0.1443 -1.2557 0.2124
## ar10 0.1396 0.1416 0.9862 0.3267
## ar11 0.1659 0.1865 0.8895 0.3761
## ar12 0.0306 0.1431 0.2136 0.8313
## ar13 0.0580 0.1419 0.4087 0.6837
## ar14 -0.0612 0.1371 -0.4460 0.6566
## ar15 -0.1571 0.1355 -1.1594 0.2493
## ar16 -0.0416 0.1259 -0.3303 0.7419
## ar17 0.2554 0.1236 2.0660 0.0417
## ar18 -0.1647 0.1870 -0.8804 0.3809
## ma1 -0.1702 0.4253 -0.4001 0.6900
## ma2 0.0199 0.2957 0.0672 0.9465
## ma3 0.2131 0.1834 1.1617 0.2484
## ma4 -0.5105 0.2070 -2.4659 0.0155
## ma5 -0.5523 0.3702 -1.4916 0.1393
## sma1 -0.2570 0.1608 -1.5981 0.1135
##
## $AIC
## [1] -9.985156
##
## $AICc
## [1] -9.84075
##
## $BIC
## [1] -10.4123
The very good model was best.
## $pred
## Time Series:
## Start = 116
## End = 125
## Frequency = 1
## [1] 0.02040377 0.01367033 0.01651095 0.02295284 0.02232263 0.02037957
## [7] 0.01966316 0.01978917 0.02059040 0.02122259
##
## $se
## Time Series:
## Start = 116
## End = 125
## Frequency = 1
## [1] 0.003419422 0.004374180 0.005043979 0.005982023 0.006725938
## [6] 0.007285459 0.007552876 0.007843906 0.008100484 0.008197078
Let’s try to fit a model omitting the final 2 values, then predict the next 2 values and compare to the actual ones
## $pred
## Time Series:
## Start = 111
## End = 115
## Frequency = 1
## [1] 0.022127733 0.019324457 0.014278120 0.010703089 0.005199534
##
## $se
## Time Series:
## Start = 111
## End = 115
## Frequency = 1
## [1] 0.003414511 0.004343294 0.005065274 0.006066540 0.006882498
## [1] 0.030 0.029 0.025 0.022 0.020
Do the actual 114th and 115th values of the time series lie within the 95% bounds of our predicted 114th and 115th values? (Make this look nicer)
## [,1] [,2] [,3]
## [1,] 0.015435292 0.030 0.02882017
## [2,] 0.010811601 0.029 0.02783731
## [3,] 0.004350183 0.025 0.02420606
## [4,] -0.001187331 0.022 0.02259351
## [5,] -0.008290162 0.020 0.01868923
Yes (above shows the lower CI, actual value, upper CI) Plotting actual vs predicted:
As you can see from this plot, the next two values the time series (red line) predictsare very close to the actual values (black line). Certainly, the actual values lie well within the 95% confidence bounds (dotted red lines).
Note: We can also predict value for Oct 16, 2016, which we have the data for but omitted so that we had 23 weeks per season.
Let’s take a look at the Google Trends data:
Not stationary. Try differencing.
Fit an ARMA model to the differenced data.
## [1] 0.9823009
## [1] "Yes"
## [1] 0.9823009
## [1] "Yes"
## initial value 0.828263
## iter 2 value 0.814788
## iter 3 value 0.778813
## iter 4 value 0.777724
## iter 5 value 0.776237
## iter 6 value 0.773120
## iter 7 value 0.770630
## iter 8 value 0.767976
## iter 9 value 0.766622
## iter 10 value 0.765858
## iter 11 value 0.765474
## iter 12 value 0.764534
## iter 13 value 0.762100
## iter 14 value 0.760753
## iter 15 value 0.759372
## iter 16 value 0.757814
## iter 17 value 0.757157
## iter 18 value 0.754637
## iter 19 value 0.749358
## iter 20 value 0.749193
## iter 21 value 0.748115
## iter 22 value 0.747874
## iter 23 value 0.747792
## iter 24 value 0.747757
## iter 25 value 0.747755
## iter 26 value 0.747754
## iter 27 value 0.747754
## iter 27 value 0.747754
## iter 27 value 0.747754
## final value 0.747754
## converged
## initial value 0.753023
## iter 2 value 0.751907
## iter 3 value 0.750952
## iter 4 value 0.749253
## iter 5 value 0.748278
## iter 6 value 0.747631
## iter 7 value 0.747253
## iter 8 value 0.742399
## iter 9 value 0.741944
## iter 10 value 0.741476
## iter 11 value 0.741391
## iter 12 value 0.741246
## iter 13 value 0.741176
## iter 14 value 0.740979
## iter 15 value 0.740830
## iter 16 value 0.740762
## iter 17 value 0.740756
## iter 18 value 0.740756
## iter 18 value 0.740756
## iter 18 value 0.740756
## final value 0.740756
## converged
## $fit
##
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D,
## Q), period = S), include.mean = !no.constant, optim.control = list(trace = trc,
## REPORT = 1, reltol = tol))
##
## Coefficients:
## ar1 ar2 ma1 ma2
## 0.2938 -0.9191 -0.4733 1.0000
## s.e. 0.0382 0.0381 0.0270 0.0527
##
## sigma^2 estimated as 4.205: log likelihood = -246.21, aic = 502.41
##
## $degrees_of_freedom
## [1] 111
##
## $ttable
## Estimate SE t.value p.value
## ar1 0.2938 0.0382 7.6856 0
## ar2 -0.9191 0.0381 -24.1349 0
## ma1 -0.4733 0.0270 -17.5042 0
## ma2 1.0000 0.0527 18.9879 0
##
## $AIC
## [1] 2.505817
##
## $AICc
## [1] 2.527995
##
## $BIC
## [1] 1.601293
## initial value 0.828263
## iter 2 value 0.775973
## iter 3 value 0.773992
## iter 4 value 0.773947
## iter 5 value 0.773947
## iter 5 value 0.773947
## iter 5 value 0.773947
## final value 0.773947
## converged
## initial value 0.773687
## iter 2 value 0.773685
## iter 3 value 0.773684
## iter 3 value 0.773684
## iter 3 value 0.773684
## final value 0.773684
## converged
## $fit
##
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D,
## Q), period = S), include.mean = !no.constant, optim.control = list(trace = trc,
## REPORT = 1, reltol = tol))
##
## Coefficients:
## ar1 ar2
## -0.2344 -0.2692
## s.e. 0.0911 0.0914
##
## sigma^2 estimated as 4.691: log likelihood = -249.96, aic = 505.92
##
## $degrees_of_freedom
## [1] 113
##
## $ttable
## Estimate SE t.value p.value
## ar1 -0.2344 0.0911 -2.5737 0.0114
## ar2 -0.2692 0.0914 -2.9446 0.0039
##
## $AIC
## [1] 2.580526
##
## $AICc
## [1] 2.599798
##
## $BIC
## [1] 1.628264
Similar AICc, but we want a model with no MA terms. So use the 2nd.
## ar1 ar2
## -0.2343787 -0.2691571
Filter the time series x using the model for y
Google Trends and Flutracking data do not appear to be correlated, so we shouldn’t use the Google Trends data to predict the Flutracking data.
We should also try fitting a model to the vaccinated survey respondents (i.e. % ofvaccinated survey respondents who report flu-related symptoms during a given week).
## [1] 3e-04
## [1] 0.02883333
Not stationary (why?), so difference by one lag
There is an outlier. Difference by 2 lags?
Same outlier. We will use Dz for now, but this is bad.
## [1] 0.93
## [1] "No"
## [1] 0.95
## [1] "Yes"
There is still a seasonal effect every 23 weeks. So let’s account for seasonality.
## [1] 0.9555556
## [1] "Yes"
## [1] 0.9666667
## [1] "Yes"
Looking at the seasons (every 23 lags), the ACF has a spike at 23 (i.e. Q=1). We can say either the ACF tails off, in which case we can choose a value for P (say P=1, as there is a spike in the PACF at 23); or we can say the ACF cuts off, in which case P=0 by definition. Looking within each 23 week season, we can say that: the ACF cuts off at 1 (q=1 & p=0); we can say the ACF trails off with a spike at 1 and the PACF trails off with a spike at 1 (q=1 & p=1); we can say that the PACF cuts off at 1 (p=1 & q=0). Also consider possible spikes in the ACF and PACF at lag 9.
Cases where P=0 and Q=1
## initial value -5.664793
## iter 2 value -5.743848
## iter 3 value -5.748846
## iter 4 value -5.749018
## iter 5 value -5.749020
## iter 5 value -5.749020
## iter 5 value -5.749020
## final value -5.749020
## converged
## initial value -5.761476
## iter 2 value -5.768288
## iter 3 value -5.774187
## iter 4 value -5.774277
## iter 5 value -5.774288
## iter 5 value -5.774288
## iter 5 value -5.774288
## final value -5.774288
## converged
## $fit
##
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D,
## Q), period = S), include.mean = !no.constant, optim.control = list(trace = trc,
## REPORT = 1, reltol = tol))
##
## Coefficients:
## ma1 sma1
## -0.2184 -0.6634
## s.e. 0.0875 0.2352
##
## sigma^2 estimated as 8.367e-06: log likelihood = 396.34, aic = -786.67
##
## $degrees_of_freedom
## [1] 113
##
## $ttable
## Estimate SE t.value p.value
## ma1 -0.2184 0.0875 -2.4959 0.0140
## sma1 -0.6634 0.2352 -2.8206 0.0057
##
## $AIC
## [1] -10.65648
##
## $AICc
## [1] -10.63721
##
## $BIC
## [1] -11.60875
## initial value -5.666641
## iter 2 value -5.753119
## iter 3 value -5.757444
## iter 4 value -5.757500
## iter 4 value -5.757500
## iter 4 value -5.757500
## final value -5.757500
## converged
## initial value -5.769069
## iter 2 value -5.777173
## iter 3 value -5.783975
## iter 4 value -5.784106
## iter 5 value -5.784122
## iter 6 value -5.784122
## iter 6 value -5.784122
## final value -5.784122
## converged
## $fit
##
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D,
## Q), period = S), include.mean = !no.constant, optim.control = list(trace = trc,
## REPORT = 1, reltol = tol))
##
## Coefficients:
## ar1 sma1
## -0.2828 -0.6942
## s.e. 0.1005 0.2545
##
## sigma^2 estimated as 8.062e-06: log likelihood = 397.23, aic = -788.46
##
## $degrees_of_freedom
## [1] 113
##
## $ttable
## Estimate SE t.value p.value
## ar1 -0.2828 0.1005 -2.8139 0.0058
## sma1 -0.6942 0.2545 -2.7276 0.0074
##
## $AIC
## [1] -10.69359
##
## $AICc
## [1] -10.67432
##
## $BIC
## [1] -11.64585
Cases where P=1 and Q=1
## $pred
## Time Series:
## Start = 116
## End = 125
## Frequency = 1
## [1] 0.01576768 0.01586229 0.01998301 0.01999209 0.02184931 0.02258010
## [7] 0.02148729 0.02312899 0.02208322 0.02407620
##
## $se
## Time Series:
## Start = 116
## End = 125
## Frequency = 1
## [1] 0.002879329 0.003534874 0.004205850 0.004753199 0.005251460
## [6] 0.005704361 0.006124389 0.006517262 0.006887800 0.007239387
Let’s try to fit a model omitting the final 2 values, then predict the next 2 values and compare to the actual ones
## $pred
## Time Series:
## Start = 96
## End = 115
## Frequency = 1
## [1] 0.02039634 0.02362814 0.02379617 0.02237217 0.02370805 0.02305938
## [7] 0.02507894 0.02461871 0.02352787 0.02425673 0.02708805 0.02954827
## [13] 0.03370264 0.03512297 0.03097724 0.02989696 0.02906867 0.02189291
## [19] 0.01863522 0.01597923
##
## $se
## Time Series:
## Start = 96
## End = 115
## Frequency = 1
## [1] 0.003129429 0.003771994 0.004492737 0.005061512 0.005587611
## [6] 0.006063709 0.006506443 0.006920485 0.007311249 0.007682121
## [11] 0.008035908 0.008374758 0.008700423 0.009014329 0.009317667
## [16] 0.009611435 0.009896487 0.010173556 0.010443276 0.010706202
## [1] 0.017 0.018 0.019 0.018 0.020 0.017 0.020 0.021 0.022 0.019 0.021
## [12] 0.020 0.023 0.023 0.025 0.024 0.023 0.021 0.018 0.018
Do the actual 114th and 115th values of the time series lie within the 95% bounds of our predicted 114th and 115th values? (Make this look nicer)
## [,1] [,2] [,3]
## [1,] 0.014262655 0.017 0.02653002
## [2,] 0.016235032 0.018 0.03102125
## [3,] 0.014990407 0.019 0.03260194
## [4,] 0.012451608 0.018 0.03229274
## [5,] 0.012756331 0.020 0.03465977
## [6,] 0.011174512 0.017 0.03494425
## [7,] 0.012326316 0.020 0.03783157
## [8,] 0.011054559 0.021 0.03818286
## [9,] 0.009197817 0.022 0.03785791
## [10,] 0.009199774 0.019 0.03931369
## [11,] 0.011337673 0.021 0.04283843
## [12,] 0.013133747 0.020 0.04596280
## [13,] 0.016649809 0.023 0.05075547
## [14,] 0.017454887 0.023 0.05279106
## [15,] 0.012714613 0.025 0.04923987
## [16,] 0.011058547 0.024 0.04873537
## [17,] 0.009671557 0.023 0.04846579
## [18,] 0.001952743 0.021 0.04183308
## [19,] -0.001833599 0.018 0.03910404
## [20,] -0.005004931 0.018 0.03696338
Yes (above shows the lower CI, actual value, upper CI) Plotting actual vs predicted:
As you can see from this plot, the next two values the time series (red line) predicts are close to the actual values (black line). The predicted values are slighly lower than expected, but the actual values do lie within the 95% confidence bounds (dotted red lines).